大規模言語モデルのアーキテクチャ進化：BERTからGPT、T5へ

Transformerアーキテクチャの三つ組

大規模言語モデルの進化は、パラダイムシフトタスク固有のモデルから「統合的事前学習」への移行を特徴としています。これは、単一のアーキテクチャが複数の自然言語処理（NLP）のニーズに適応できるようになる仕組みです。

この変化の中心にあるのは、文脈内の異なる語の重要性を評価できる自己注意機構（Self-Attention）です：

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

1. エンコーダー専用（BERT）

メカニズム：マスクされた言語モデル（MLM）。
動作特性：双方向的な文脈；モデルは入力された文全体を一度に「見る」ことで、隠れた単語を予測します。
最適な用途：自然言語理解（NLU）、感情分析、固有表現認識（NER）。

2. デコーダー専用（GPT）

メカニズム：自己回帰モデル。
動作特性：左から右への順次処理；前の文脈に基づいて次のトークンを厳密に予測する（因果マスキング）。
最適な用途：自然言語生成（NLG）や創造的な文章作成。これは、GPT-4やLlama 3などの現代の大規模言語モデル（LLM）の基盤となっています。

3. エンコーダー・デコーダー型（T5）

メカニズム：テキスト対テキスト転移トランスフォーマー。
動作特性：エンコーダーが入力文字列を高密度な表現に変換し、デコーダーが目的の文字列を生成します。
最適な用途：翻訳、要約、同等性タスク。

重要な洞察：デコーダーの優位性

業界は大きく、デコーダー専用アーキテクチャに集約しています。これは、ゼロショット状況での優れたスケーリング法則と発現する推論能力によるものです。

VRAMのコンテキストウィンドウへの影響

デコーダー専用モデルでは、KVキャッシュシーケンス長に比例して増加します。10万単語のコンテキストウィンドウは8千単語のものよりも大幅に多くのVRAMを必要とします。そのため、量子化なしで長文コンテキストモデルをローカルに展開することは困難です。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?

Decoders scale more effectively for generative tasks and follow-up instructions via next-token prediction.

Encoders cannot process text bidirectionally.

Decoders require less training data for classification tasks.

Encoders are incompatible with the Self-Attention mechanism.

Question 2

Which architecture treats every NLP task as a "text-to-text" problem?

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5)

Recurrent Neural Networks (RNN)

Challenge: Architectural Bottlenecks

Analyze deployment constraints based on architecture.

If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.

Step 1

Identify the architectural bottleneck regarding context processing.

Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.

Step 2

Justify the preference using Scaling Laws.

Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.